Text and Hypertext Categorization
نویسندگان
چکیده
Automatic categorization of text documents has become an important area of research in the last two decades, with features that make it significantly more difficult than the traditional classification tasks studied in machine learning. A more recent development is the need to classify hypertext documents, most notably web pages. These have features that add further complexity to the categorization task but also offer the possibility of using information that is not available in standard text classification, such as metadata and the content of the web pages that point to and are pointed at by a web page of interest. This chapter surveys the state of the art in text categorization and hypertext categorization, focussing particularly on issues of representation that differentiate them from 'conventional' classification tasks and from each other.
منابع مشابه
Classification Techniques for Categorization of Hypertext Documents
In this paper we investigate techniques for categorization of hypertext documents. Recent years have witnessed a growing interest in applying text categorization techniques to the Web. However, the semi-structured nature of the Web along with diverse subject matter present in it pose interesting challenges for conventional classification techniques. In this paper, we review some of the techniqu...
متن کاملImpact on Performance of Hypertext Classification of Selective Rich HTML Capture
Hypertext categorization is the automatic classification of web documents into predefined classes. It poses new challenges for automatic categorization because of the rich information in a hypertext document. Hyperlinks, HTML tags, and metadata all provide rich information for hypertext categorization that is not available in traditional text classification. This paper looks at (i) what represe...
متن کاملNeighbourhood Exploitation in Hypertext Categorization
As the web expands exponentially, the need to put some order to its content becomes apparent. Hypertext categorization, that is the automatic classification of web documents into predefined classes, came to elevate humans from that task. The extra information available in a hypertext document poses new challenges for automatic categorization. HTML tags and linked neighbourhood all provide rich ...
متن کاملTowards Structure-sensitive Hypertext Categorization
Hypertext categorization is the task of automatically assigning category labels to hypertext units. Comparable to text categorization it stays in the area of function learning based on the bag-of-features approach. This scenario faces the problem of a many-to-many relation between websites and their hidden logical document structure. The paper argues that this relation is a prevalent characteri...
متن کاملTowards Logical Hypertext Structure A Graph-Theoretic Perspective
Facing the retrieval problem according to the overwhelming set of documents online the adaptation of text categorization to web units has recently been pushed. The aim is to utilize categories of web sites and pages as an additional retrieval criterion. In this context, the bagof-words model has been utilized just as HTML tags and link structures. In spite of promising results this adaptation s...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009